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Invasive cervix cancer (ICC) is the third most common malignant tumor in women and 
human papillomavirus 16 (HPV16) causes more than 50% of ICC. DNA methylation 
is a covalent modification predominantly occurring at CpG dinucleotides and increased 
methylation across the HPV16 genome is strongly associated with ICC development. Next 
generation (Next Gen) sequencing has been proposed as a novel approach to determine 
DNA methylation. However, utilization of this method to survey CpG methylation in the 
HPV16 genome is not well described. Moreover, it provides additional information on 
methylation "haplotypes." In the current study, we chose 12 random samples, amplified 
multiple segments in the HPV16 bisulfite treated genome with specific barcodes, 
inspected the methylation ratio at 31 CpG sites for all samples using lllumina sequencing, 
and compared the results with quantitative pyrosequencing. Most of the CpG sites 
were highly consistent between the two approaches (overall correlation, r = 0.92), thus 
verifying that Next Gen sequencing is an accurate and convenient method to survey 
HPV16 methylation and thus can be used in clinical samples for risk assessment. 
Moreover, the CpG methylation patterns (methylation haplotypes) in single molecules 
identified an excess of complete-and non-methylated molecules and a substantial amount 
of partial-methylated ones, thus indicating a complex dynamic for the mechanisms of 
HPV16 CpG methylation. In summary, the advantages of Next Gen sequencing compared 
to pyrosequencing for HPV genome methylation analyses include higher throughput, 
increased resolution, and improved efficiency of time and resources. 

Keywords: human papillomavirus, methylation, next generation sequencing, CpG methylation, methylation 
haplotypes 



INTRODUCTION 

Invasive cervical cancer (ICC) is the third most common malig- 
nant tumor in women and is caused by persistent infection of 
oncogenic human papillomavirus (HPV) (Jemal et al., 2011), 
especially type 16, which accounts for greater than 50% of all 
ICC (Schiffman et al, 2007; Li et al, 2011; Schiffman and 
Wentzensen, 2013). Recent data indicates that multiple regions of 
HPV16 and other oncogenic HPV type genomes show increasing 
CpG methylation patterns among normal, cervical intraepithe- 
lial neoplasia (CIN), and cancer tissues, respectively (Badal et al., 
2003; Kalantari et al, 2004, 2009, 2010, 2014; Hong et al, 2008; 
Brandsma et al, 2009, 2014; Ding et al, 2009; Fernandez et al., 
2009; Fernandez and Esteller, 2010; Piyathilake et al, 2011; Sun 
et al., 201 1; Wentzensen et al., 2012; Lorincz et al., 2013; Mirabello 
et al., 2013). Thus, assays for quantitation of CpG methylation 
of oncogenic HPV genomes in general and HPV16 in particu- 
lar, indicate that methylation is a promising biomarker for ICC 
development (Clarke et al., 2012). Therefore, a fast, accurate, 
and high-throughput approach to survey DNA methylation 



in HPV16 should facilitate ICC prevention, diagnosis, and 
treatment. 

So far, the most widely used method for HPV DNA methy- 
lation investigation is bisulfite treatment followed by sequencing 
(Bhattacharjee and Sengupta, 2006), MassArray (Ehrich et al., 
2005), SNPshot (Kaminsky and Petronis, 2009), or in particular 
pyrosequencing (Tost and Gut, 2007a,b; Dejeux et al., 2009) (for 
review, see Clarke et al., 2012). Despite the accuracy of CpG quan- 
titation by pyrosequencing, it can only provide a relatively short 
read for each assay per sample and thus is time and labor inten- 
sive for testing multiple sites in large numbers of samples, which 
limits the incorporation of DNA methylation in clinical studies. 
Moreover, all these approaches constrain the ability to detect the 
methylation pattern at single-DNA-molecule resolution, which is 
critical for investigating methyltransferase dynamics. Although 
the cloning-sequencing approach after bisulfite treatment can 
provide some insight into this issue, the typically low num- 
ber of clones analyzed (<10) and the high costs limits this 
approach. 
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Next generation sequencing can yield millions of single 
molecule reads and has been used to determine DNA methyla- 
tion (Taylor et al., 2007; Bibikova and Fan, 2010; Laird, 2010; Feng 
et al, 2011; Kim et al, 2011; Komori et al, 2011; Ku et al., 2011; 
Nejman et al., 2014). Supplemented with DNA barcoding tech- 
nology, which incorporates a unique index sequence into each 
PCR segment, this approach can provide a rapid way to simulta- 
neously determine DNA methylation at the single-molecule level 
in large numbers of samples. However, the accuracy and valid- 
ity of this approach needs further evaluation, especially in viral 
genomes such as HPV16. 

In the present study, we randomly chose 12 samples with 
quantitation of CpG methylation within the HPV16 genome 
by pyrosequencing and performed amplification with primers 
containing barcodes specific for each sample. After all sam- 
ples were pooled and purified, the PCR products were deep 
sequenced, analyzed, and the results were compared with CpG 
methylation determined by pyrosequencing. The methylation 
ratio for most CpG sites was highly correlated with those 
from pyrosequencing, which indicated that Next Gen sequenc- 
ing of bisulfite treated cervical cells infected with HPV16 was 
an accurate method of quantitating CpG methylation. Moreover, 
the single molecule analyses provides a "methylation haplo- 
type" and indicated an excess of full and non-methylated 
molecules in nearly all samples and a lower proportion of 
partially methylated molecules in most samples, thus reveal- 
ing a complex and mosaic methylation pattern in the HPV16 
genome. 

MATERIALS AND METHODS 
CERVICOVAGINAL SAMPLES 

Twelve exfoliated cervical samples were randomly chosen from a 
previously reported nested case-control study of HPV16-positive 
cervical intraepithelial neoplasia grade 3 (CIN3) and HPV16- 
positive cervical samples that cleared infection (Mirabello et al., 
2012). The lab was blinded to all clinical information. All sam- 
ples contained HPV16 and the quantitation of CpG methylation 
of the HPV16 genome had been determined by pyrosequencing, 
as described (Mirabello et al., 2012). The study was designed to 
test samples for quantitation of CpG methylation to evaluate Next 
Gen sequencing compared to pyrosequencing prior to embark- 
ing on using this method on precious well-characterized samples 
from epidemiological studies. 

ASSAY DESIGN 

Since LI, L2, and E2 ORF regions showed differential methyla- 
tion among disease groups (Mirabello et al., 2012), these three 
regions and the most significant CpG sites within them were 
chosen for the current study. Primers for PCR were designed 
using MethPrimer (Li and Dahiya, 2002) (http://www.urogene. 
org/methprimer/indexl.html). In total, 3 segments in LI, 4 in L2, 
and 1 in E2 were included in the current study (Table 1A), and in 
total, 3 1 CpG sites were surveyed. Each primer was labeled by a 
unique barcode and 5' and 3' padding sequence (Table IB) and 
synthesized by Integrated DNA Technologies (IDT, Coralville, 
IA). A map of all 112 CpG sites in the HPV16 7906 bp reference 
genome is shown in Figure 1 in the review by Clarke et al. (2012). 



Table 1A | Next Gen sequencing assays of bisulfite treated HPV16 
DNA in clinical samples. 



Assay name 


#CpG 


CpG position 


Length (bp) 


L1_1 


4 


5602, 5608, 5611, 5617 


114 


L1_2 


4 


7034, 7091, 7136, 7145 


172 


L1_7 


2 


6650, 6581 


167 


|_2 1 


5 


4240 4249 4261 4270 
4277 


130 


L2_2 


3 


4427, 4437, 4441 


89 


L2_4 


3 


5128, 5173, 5179 


123 


L2_5 


1 


5378 


166 


E2_1 


5 


3412, 3415, 3417, 3433, 
3436 (3448, 3462, 3473, 
3496) 


169 



(), These sites were present in the NGS data but not PSQ. 



Table 1B | Description of barcoded primers* for assays shown in this 



table. 


Primer 


5' pad 


3' pad 


Primer target 


name 


(LP) 


(RP) 


sequence (5 -3 ) 


16E2. 


_1F 


ACT 


GCAG 


TTAGGTAGTATTTGGTTAATTATTT 


16E2. 


_1R 


ACT 


GCAG 


ATTAAAACACTATCCACTAAATCTCTATAC 


16L1. 


_1F 


TAC 


GTAC 


TAATATATAATTATTGTTGATGTAG GTGAT 


16L1. 


_1R 


TAC 


GTAC 


AACAACCAAAAAAACATCTAAAAAA 


16L1. 


2F 


ACT 


GACG 


TTTGTAGATTTAGATTAG I I I I I I I IAGGA 


16L1. 


2R 


ACT 


GACG 


TTCAACATACATACAATACTTACAACTTAC 


16L1. 


7F 


TAC 


GATG 


ATGTAG I I I I I GAAGTAGATATGGTAGTA 


16L1. 


7R 


TAC 


GATG 


AATTACCTCTAATACCCAAATATTCAA 


16L2. 


_1F 


ATC 


GACG 


I I I I IGI I IGI I IGI I IGI I I I I 


16L2. 


_1R 


ATC 


GACG 


ACATATACCTACCTATTTACATATTTTATA 


16L2. 


2F 


ATC 


GACG 


TATGGAAGTATGGGTGTAI I I I I 


16L2. 


2R 


ATC 


GACG 


ATTCCCAATAAAATATACCCAATAC 


16L2. 


4F 


ATC 


GTAC 


TTTTGGATATAGTTGTTTTATATAGGTTAG 


16L2. 


4R 


ATC 


GTAC 


C CTTAACACCTATAAATTTTC C ACTAC 


16L2. 


5F 


ATC 


GTCA 


TTGTAGAAGAAATAGAATTATAAATTATAA 


16L2. 


5R 


ATC 


GTCA 


AAAAATATAAAAAATAC AAATAATAC C 



"Each primer consisted of 3 to 3 : 3 bp (LP) - 8 bp Barcode -4 bp (RP) - Primer 
Target Sequence. 

BISULFITE TREATMENT, PCR. PURIFICATION, AND DEEP SEQUENCING 

DNA samples containing HPV16 DNA were treated with freshly 
prepared bisulfite using the EZ DNA methylation kit (Zymo 
Research, Orange, CA) according to the manufacturer's proto- 
col. Upon bisulfite treatment, unmethylated C's are converted 
into U's, which are then converted to T's by Taq polymerase 
during PCR amplification; methylated C's remain unmodified. 
Thus, in the CpG sequence a "C" represents a methylated CpG, 
whereas a "T" represents an unmethyaled CpG. All segments were 
amplified by HotStart-IT FideliTaq DNA polymerase (United 
States Biochemicals, Cleveland, OH). After validating size and 
intensity in a 3% agarose gel, each PCR product was pooled 
in equal proportions, separated by electrophoresis, and isolated 
from the gel. After precipitation by isopropanol and washed 
by 70% ethanol, the PCR products were ligated with adaptor 
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FIGURE 1 | The summary correlation of 27 CpG methylation sites between pyrosequencing and Next Gen sequencing. The x-axis is percent 
methylation by pyrosequencing and the y-axis is percent methylation by Next Gen sequencing. Each CpG site is plotted for each subject. 



(library construction) and sequenced on an Illumina HiSeq 2000 
(NG sequencing) within the Albert Einstein College of Medicine, 
Epigenetics Core Facility (Bronx, NY). 

METHYLATION RATIO ANALYSIS 

The obtained sequences were filtered by prinseq (Schmieder and 
Edwards, 2011) with average Phred score (Cock et al., 2010) not 
less than 20. The barcodes for each sample were split and cut 
by FastX kit (http://hannonlab.cshl.edu/fastxtoolkit/index.html). 
After alignment with the reference HPV 16 genome by bowtie 
(Langmead et al, 2009), the methylation status for each molecule 
was determined by Bismark (Krueger and Andrews, 2011). For 
each CpG site, the methylation ratio is calculated by the for- 
mula: number of C reads divided by the sum of C and T reads 
at each CpG site. The pyrosequencing result for each site was 
determined on a PSQ96 ID Pyrosequencer (Qiagen, Valencia, 
CA) at the Albert Einstein College of Medicine, Genomics Core 
Facility (Bronx, NY) and the readout was percent methylation, 
as previously described (Mirabello et al., 2012). The correlation 
between NG sequencing and pyrosequencing was performed by 
linear regression in SPSS 16.0 (SPSS Inc., Chicago, IL) and the 
null hypothesis was rejected when P < 0.05. We have deposited 
the read sequences in the NCBI Sequence Read Archive (SRA) 
database, accession number SRP04098 1 . 

SINGLE MOLECULE ANALYSIS 

The methylation pattern for each single molecule and the counts 
of each pattern were obtained by an in-house script (available 



on request). Briefly, the Bismark output, which gave the methy- 
lation state for each CpG site in each molecule, was parsed, and 
the methylation pattern of each DNA molecule was then recon- 
structed based on the unique read name it was assigned by the 
sequencer. Finally, the prevalence of each unique methylation pat- 
tern was counted, for each sample in each assay. The expected 
probabilities were constructed in two steps. First the singular 
probabilities of each site being methylated and unmethylated were 
calculated, by counting the proportions of molecules in each 
state at each site. Multiplying the appropriate singular probabil- 
ities, under the assumption that CpG sites would be methylated 
independently, produced an expected probability for each methy- 
lation pattern. A x. 2 goodness of fit test was then performed to 
compare observed and expected probabilities for methylation pat- 
terns, where the observed probabilities were the proportions of 
each detected pattern, calculated from their counts. 

RESULTS 

SEQUENCING DATA STATISTICS 

In total, 192.2 million reads 95 bp long were obtained from NG 
sequencing and 53.4 million (27.8%) possessed an average Phred 
score above 20 and were used for this analysis (Ewing and Green, 
1998). 41.7 million reads (78.1%) were observed to contain one 
of the incorporated barcodes without mismatch and assigned to a 
corresponding sample for further analysis. Except one segment in 
the L2 gene, most segments included ~4-8 million reads (Figure 
S1A). For each sample, the read count varied from 2 to 6 million 
(Figure SIB). Most CpG sites (21/27) were covered by 0.6 to 1.8 



www.frontiersin.org 



June 2014 | Volume 5 | Article 150 | 3 



Sun et al. 



NGS HPV DNA methylation analyses 




0.40 
0.20 
0.00 



0 0.5 
(2=0.87, P<10-5 

r 4249 



li , » 

0 0.2 0.4 

(2=0.52, P=0.009 



1.00 , 

0.50 

0.00 



4277 



0 05 
(2=0.92. P<10-5 

, 4441 



) 0.5 

(2=0.88, P<10-5 



5179 



0 0.5 

(2=0.99, P<10-6 



5608 



0 0.5 
r2=0.97, P<10-6 
1.00 I 4261 

0.50 



0 05 1 

(■2=0.83, P=0 000046 

4427 



0 0.5 1 

(2=0.67, P=0.001 

oeo 4270 

0.40 ^ ,* 

0.20 • 

0.00 ^ 1 1 ' 

0 0.2 0.4 0.6 

r2=0.90, P<10-5 



0.50 I 

0.00 



(2=0.92, P<10-5 

5128 



0.50 % s 

0.00 •— — 



1.50 
1.00 
0.50 
0 00 



0 0.5 
(2=0.95, P<10-e 

5378 



0 



C 5 



(2=0 94. P<10-6 

150 5611 

1.00 

0.50 I , • 
0.00 L -* ' 



1 00 
0 50 
0 00 

1.00 
0.50 
0 00 

1.00 
0.50 

0.00 

1.00 r 



4437 



0 0.5 

(2=0.67, P=0.001 

5173 



(2=0 97, P<10-6 

5602 



r2=0.92, P<10-5 
5617 



0 50 
000 * 



(2=0 97, P<10-6 



(2=0.91. P<10-5 



(2=0 99, P<10-{ 



0.60 
0.40 
0.20 
0.00 



6581 



(2=0.98, P=0 001 



6650 



0 20 
0 00 



0.30 
0.20 
0.10 
0.00 



7034 



(2=0.35, P=0 069 



0 02 0.4 

(2=0.18, P=0 19 



0.60 
0.40 
0.20 
0.00 



7091 



0.40 
0.20 



7136 



(2=0.53, P=0.017 



0 40 

0.20 
0.00 



(2=0.71, P=0.0005 



7145 



0 0.2 0.4 0.6 
(2=0.60, P=0.005 



FIGURE 2 | The correlation of 27 individual CpG methylation sites 
between pyrosequencing and Next Gen sequencing. Each plot displays 
one CpG site and the location is indicated at the top. The x-axis indicates 
lllumina sequencing, the y-axis pyrosequencing for percent methylation. 



million reads (Figure SIC), in total. Although three fragments did 
not amplify as robustly and had less reads (i.e., Ll_2, L2_2, and 
L2_5 assays containing CpG sites 7034, 7091, 7136, 7145; 4427, 
4437, 4441; and 5378, respectively), the correlation between CpG 
sites within these fragments between the PSQ and NGS assays had 
reasonable agreement (see Figure 2). 

METHODOLOGICAL COMPARISON 

Among these 31 CpG sites, 27 had been analyzed by pyrose- 
quencng and the two results were compared. Using linear regres- 
sion, most sites showed significant correlation between the two 
methods with an overall correlation of 0.92 (Figure 1). Only 
two CpG sites (positions 6650 and 7034) were poorly correlated 
(P = 0.069 and 0.19, respectively, Figure 2). These results indi- 
cated that next generation sequencing was an appropriate method 
for determining CpG methylation in the HPV16 genome, yield- 
ing results that were highly correlated with a well-established 
pyrosequencing technique. 

SINGLE-MOLECULE RESOLUTION OF CPG METHYLATION 

The methylation pattern for each single molecule was deter- 
mined for each sample across six regions of the HPV16 genome. 
A substantial proportion (10-80%) of molecules in the six assays 
were partially methylated (possessing a mixture of methylated 
and unmethylated sites), and the distributions of patterns were 
varied. However, the site-wise proportions of methylation for 
a given sample in a given assay were similar and the distribu- 
tion of patterns in each molecule did exhibit dependence on 
that methylation level. To address this issue, we calculated the 
expected frequencies of all possible methylation patterns in each 
assay and compared them with the observed patterns (Figure 3). 
In most samples, a relative excess of none- and/or fully methy- 
lated molecules was observed (Figure 3). In contrast, despite their 
high prevalences, there was a relative absence of partial methy- 
lated molecules (Figure 3). As a consequence, most of the samples 
yielded a significant P-value (p < 0.05), thus indicating that CpG 
sites are not likely to be methylated/demethylated in an indepen- 
dent fashion, but that a more complex process determines the 
methylation state within a region of the HPV16 genome. 

DISCUSSION 

In the present study, we used lllumina Next Gen single molecule 
sequencing to survey the methylation of the HPV16 genome in 
12 samples and compared the result with pyrosequencing, which 
provides an average percent methylation for each CpG site. A con- 
sequence of using Next Gen sequencing technology was the ability 
to investigate the methylation pattern at the single-molecule level. 
Our results demonstrate that the Next Gen bisulfite sequencing 
protocol was comparable to the well-established pyrosequencing- 
based method for assaying HPV16 genome methylation (over- 
all correlation = 0.92). The ability to analyze single molecules 
allowed us to test whether the process of CpG methylation at 
specific sites was independent. Thus with known percent CpG 
methylation at each site we compared the observed with the 
predicted methylation haplotypes. There was a significant dif- 
ference indicating that CpG methylation of multiple CpG sites 
on a given fragment of the viral genome is not an independent 
process. In addition, utilizing DNA barcoding, multiple samples 



can be pooled together and run in a single sequencing reaction 
and the result for each single sample can be distinguished without 
ambiguity. Thus, the high-throughput nature of this technique 
facilitates large-scale clinical and epidemiological studies. 
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Several CpG sites presented a relatively low read count com- 
pared with others. A careful inspection indicated that these were 
located in the middle of the amplicon and would thus appear 
at the end of the read in both strand. These end regions usually 
have a low base call quality, due to constraints of the sequencing 



technology and are, therefore, often filtered out prior to analysis. 
The Bismark-based analysis method utilized in this study doesn't 
exploit the gain in base call quality that can be obtained by 
comparing overlapping regions of paired-end reads, therefore 
the latter problem could be surmounted by improvements 






ii 




FIGURE 3 | Continued 
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FIGURE 3 | The distribution of the differences between observed and 
expected frequencies for each "methylation haplotype" in assay L1 1 
(A), L1_7 (B), L2_1 (C), L2_2 (D), L2_4 (E), and E2_1 (F). Each boxplot 
indicate one "methylation haplotype" combination. The green and red bars 
denote positive and negative value, respectively. (A) "Methylation haplotype" 



frequency differences in assay L1_1. (B) "Methylation haplotype" frequency 
differences in assay L1_7. (C) "Methylation haplotype" frequency differences 
in assay L2_1. (D) "Methylation haplotype" frequency differences in assay 
L2_2. (E) "Methylation haplotype" frequency differences in assay L2_4. (F) 
"Methylation haplotype" frequency differences in assay E2_1. 



in the bioinformatics pipeline. Alternatively, we could shorten 
the amplicon, or generate longer reads, to facilitate equal read 
counts for all sites in an assay. However, there are limitations on 
shortening or lengthening the amplicons, due to the composition 
of sequences surrounding CpG sites that are used for the primers. 
Nevertheless, it is anticipated that with advances in Next Gen 
sequencing technologies longer fragment reads and improved 
quality will facilitate the use of the methods described in this 
report. 

Despite the high consistency between next generation 
sequencing and pyrosequencing results, two CpG sites, 6650 and 
7034, failed to yield a significant correlation. Both CpG sites had a 
relatively high read count (> 1.14 and 0.15 million, respectively), 
thus indicating that the low correlation was not due to stochastic 
effects. A detailed audit identified one sample showing remark- 
able discrepancy between the two approaches in both CpG sites 
(see red circle in Figure 2). When this sample was removed from 
analysis, significant correlations were obtained for both CpG 
sites (r 2 = 0.85, P = 0.00006 for 6650 and r 2 = 0.76, P = 0.001 
for 7034), which further verified the strong consistency between 
the two approaches. However, the reason for the discrepancy in 
this sample remains unclear. Possibilities include a nucleotide 
variation at this position, which would skew the results, and/or 
bisulfite- and PCR-induced artifacts. In addition, potential PCR 
biases could also result in lower than expected read numbers. 

Examining methylation patterns on individual molecules of 
DNA is essential to understand the methyltransferase dynamics 



and the mechanism(s) by which methylation interacts with 
oncogenic HPV viral natural history and progression to cervi- 
cal cancer. Based on cloning approaches, most previous stud- 
ies suggested that methylation in promoter regions was more 
likely to be one (fully methylated) or zero (fully unmethy- 
lated) in human genomic DNA (Oates et al, 2006) and 
HPV genomes (Kalantari et al, 2004, 2009, 2010; Turan 
et al, 2006, 2007; Ding et al, 2009; Fernandez et al., 2009). 
However, due to the low number of single molecules ana- 
lyzed (i.e., clones sequenced), the real composition might 
be substantially different. Next Gen sequencing, which can 
survey millions of molecules at the same time, can pro- 
vide more insight into this issue. In our results, despite 
the relative excess of fully methylated and fully unmethy- 
lated molecules, a substantial proportion of molecules dis- 
played a partial methylation pattern, which verified previous 
observations (Taylor et al., 2007) and hinted at the com- 
plex regulation of the methylation process. Whether there are 
dynamic changes in CpG methylation patterns remains to be 
determined. 
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